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Abstract 



The ^-regularized maximum likelihood estimation problem has recently become a topic of 
great interest within the machine learning, statistics, and optimization communities as a method 
for producing sparse inverse covariance estimators. In this paper, a proximal gradient method 
(G-ISTA) for performing ^-regularized covariance matrix estimation is presented. Although 
numerous algorithms have been proposed for solving this problem, this simple proximal gradient 
method is found to have attractive theoretical and numerical properties. G-ISTA has a linear rate 
of convergence, resulting in an C(log e) iteration complexity to reach a tolerance of e. This paper 
gives eigenvalue bounds for the G-ISTA iterates, providing a closed-form linear convergence rate. 
The rate is shown to be closely related to the condition number of the optimal point. Numerical 
convergence results and timing comparisons for the proposed method are presented. G-ISTA is 
shown to perform very well, especially when the optimal point is well-conditioned. 

1 Introduction 

Datasets from a wide range of modern research areas are increasingly high dimensional, 
which presents a number of theoretical and practical challenges. A fundamental example 
is the problem of estimating the covariance matrix from a dataset of n samples {X^}f =1 , 




drawn i.i.d from a p-dimensional, zero-mean, Gaussian distribution with covariance matrix 
E G S p ++ , X® ~ JVp(0,£), where S p ++ denotes the space of p x p symmetric, positive 
definite matrices. When n > p the maximum likelihood covariance estimator £ is the 




sample covariance matrix S — ^ Y^=i X^X^ T . 



A problem however arises when n < p, due 
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to the rank-deficiency in S. In this sample deficient case, common throughout several modern 
applications such as genomics, finance, and earth sciences, the matrix S is not invertible, 
and thus cannot be directly used to obtain a well-defined estimator for the inverse covariance 
matrix Q := X -1 . 

A related problem is the inference of a Gaussian graphical model ([331 [18]), that is, a sparsity 
pattern in the inverse covariance matrix, Q. Gaussian graphical models provide a powerful 
means of dimensionality reduction in high- dimensional data. Moreover, such models al- 
low for discovery of conditional independence relations between random variables since, for 
multivariate Gaussian data, sparsity in the inverse covariance matrix encodes conditional 
independences. Specifically, if X = (A^)? =1 e M. p is distributed as X ~ A^,(0,S), then 
(£ -1 )y = Qij = -<=>■ Xi _LL Xj\{Xk}k^i,j, where the notation A _LL B\C denotes the 
conditional independence of A and B given the set of variables C (see [331 HE])- If a dataset, 
even one with n 3> p is drawn from a normal distribution with sparse inverse covariance 
matrix Q, the inverse sample covariance matrix S 1-1 will almost surely be a dense matrix, 
although the estimates for those which are equal to may be very small in magnitude. 
As sparse estimates of Q are more robust than S" 1 , and since such sparsity may yield easily 
interpretable models, there exists significant impetus to perform sparse inverse covariance 
estimation in very high dimensional low sample size settings. 

Banerjee et al. [1] proposed performing such sparse inverse covariance estimation by solving 
the ^-penalized maximum likelihood estimation problem, 




where p > is a penalty parameter, (S,Q) = Tr(S'O), and H©^ = Ylij F° r P > 0? 

Problem pj) is strongly convex and hence has a unique solution, which lies in the positive 
definite cone due to the logdet term, and is hence invertible. Moreover, the l\ penalty 
induces sparsity in Q*, as it is the closest convex relaxation of the — 1 penalty, ||O|| = 
Ylij ^(®«i 7^ O)) where !(•) is the indicator function [5]. The unique optimal point of problem 
0, 0*, is both invertible (for p > 0) and sparse (for sufficiently large p), and can be used 
as an inverse covariance matrix estimator. 

In this paper, a proximal gradient method for solving Problem ([T]) is proposed. The resulting 
"graphical iterative shrinkage thresholding algorithm", or G-ISTA, is shown to converge at a 
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linear rate to 0*, that is, its iterates t are proven to satisfy 



\\e t+1 -e;\\ F <s\\e t -e;\\ F , (2) 

for a fixed worst-case contraction constant s G (0,1), where denotes the Frobenius 
norm. The convergence rate s is provided explicitly in terms of S and p, and importantly, 
is related to the condition number of 0*. 

We also note that methods outside the penalized likelihood framework have been proposed 
in the context of graphical models. In particular graphical model estimation and related 
problems has also be undertaken either in the Bayesian or testing frameworks. The reader 
is referred to the theoretical work of PHI EH HU H3 H2J [2H], among others, for greater detail. 

The paper is organized as follows. Section [2] describes prior work related to solution of 
Problem ([I]). The G-ISTA algorithm is formulated in Section [3j Section [4] contains the 
convergence proofs of this algorithm, which constitutes the primary mathematical result of 
this paper. Numerical results are presented in Section [5j and concluding remarks are made 
in Section |6l 



2 Prior Work 

While several excellent general convex solvers exist (for example, [12] and [1]), these are 
not always adept at handling high dimensional problems (i.e., p > 1000). As many modern 
datasets have several thousands of variables, numerous authors have proposed efficient algo- 
rithms designed specifically to solve the ^-penalized sparse maximum likelihood covariance 
estimation problem (Tip _ 

These can be broadly categorized as either primal or dual methods. Following the literature, 
we refer to primal methods as those which directly solve Problem (jl]), yielding a concentration 
estimate. Dual methods PQ yield a covariance matrix by solving the constrained problem, 

minimize — log det(>S + U) — p 

u&rp x p ' (3) 

subject to Halloo < p, 

where the primal and dual variables are related by = (S + U)^ 1 . Both the primal and 
dual problems can be solved using block methods (also known as "row by row" methods), 
which sequentially optimize one row/column of the argument at each step until convergence. 
The primal and dual block problems both reduce to ^-penalized regressions, which can be 
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solved very efficiently. 



2.1 Dual Methods 

A number of dual methods for solving Problem ([!]) have been proposed in the literature. 
Banerjee et al. p] consider a block coordinate descent algorithm to solve the block dual 
problem, which reduces each optimization step to solving a box-constrained quadratic pro- 
gram. Each of these quadratic programs is equivalent to performing a "lasso" (^-regularized) 
regression. Friedman et al. pj] iteratively solve the lasso regression as described in [lj, but 
do so using coordinate-wise descent. Their widely used solver, known as the graphical lasso 
(glasso) is implemented on CRAN. Global convergence rates of these block coordinate meth- 
ods are unknown. D'Aspremont et al. [U] use Nesterov's smooth approximation scheme, 
which produces an e-optimal solution in 0(1/ e) iterations. A variant of Nesterov's smooth 
method is shown to have a 0(1/ y/e) iteration complexity in [20, 2 1 J . 

2.2 Primal Methods 

Interest in primal methods for solving Problem Q has been growing for many reasons. One 
important reason stems from the fact that convergence within a certain tolerance for the dual 
problem does not necessarily imply convergence within the same tolerance for the primal. 

Yuan and Lin [SH] use interior point methods based on the max-det problem studied in [32J . 
Yuan [37] use an alternating-direction method, while Scheinberg et al. [30] proposes a similar 
method and show a sublinear convergence rate. Mazumder and Hastie |23j consider block- 
coordinate descent approaches for the primal problem, similar to the dual approach taken 
in [TT]. Mazumder and Agarwal (22] also solve the primal problem with block-coordinate 
descent, but at each iteration perform a partial as opposed to complete block optimization, 
resulting in a decreased computational complexity per iteration. Convergence rates of these 
primal methods have not been considered in the literature and hence theoretical guarantees 
are not available. Hsieh et al. [16] propose a second-order proximal point algorithm, called 
QUIC, which converges superlinearly locally around the optimum. 

3 Methodology 

In this section, the graphical iterative shrinkage thresholding algorithm (G-ISTA) for solving 
the primal problem ([I]) is presented. A rich body of mathematical and numerical work 



4 



exists for general iterative shrinkage thresholding and related methods; see, in particular, 
El E3 E51 EH El]. A brief description is provided here. 



3.1 General Iterative Shrinkage Thresholding (ISTA) 

Iterative shrinkage thresholding algorithms (ISTA) are general first-order techniques for solv- 
ing problems of the form 

minimize F(x) := f(x) + g(x), (4) 

where X is a Hilbert space with inner product (-, •) and associated norm ||-||, / : X — > R is 
a continuously differentiable, convex function, and g : X — > K is a lower semi-continuous, 
convex function, not necessarily smooth. The function / is also often assumed to have 
Lipschitz-continuous gradient V/, that is, there exists some constant L > such that 



\\Vf(x 1 )-Vf(x 2 )\\<L\\x 1 -x 2 \\ (5) 



for any x±, x 2 G X. 



For a given lower semi-continuous convex function g, the proximity operator of g, denoted 
by prox 9 : X — > X, is given by 

prox g (x) = argmin ^g(y) + ^\\x- yfj , (6) 

It is well known (for example, [S]) that x* G X is an optimal solution of problem Q if and 
only if 



x 



prox^x* - CV/(z*)) (7) 



for any £ > 0. The above characterization suggests a method for optimizing problem §4§ 
based on the iteration 

x t+l = prox ?t9 (x t - CtV/(z t )) (8) 

for some choice of step size, Q. This simple method is referred to as an iterative shrinkage 
thresholding algorithm (ISTA). For a step size ( t < j, the ISTA iterates x t are known to 
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satisfy 



F(x t )-F(x*)~0[^)yt, (9) 



where x* is some optimal point, which is to say, they converge to the space of optimal points 
at a sublinear rate. If no Lipschitz constant L for V/ is known, the same convergence result 
still holds for ( t chosen such that 

f(x t+ i) <Q Ct (x t+1 ,x t ), (10) 
where (•, •) : X x X — > R is a quadratic approximation to /, defined by 

Qcfo y) = f(y) + ( x - y> v /(y)} + ^ Ik - y\\ 2 ■ (n) 

See [3] for more details. 

3.2 Graphical Iterative Shrinkage Thresholding (G-ISTA) 



The general method described in Section 3.1 can be adapted to the sparse inverse covariance 
estimation Problem ([I]). Using the notation introduced in Problem Q, define /, g : S>+ + — > K 
by f(X) = — log det(X) + (S, X) and g(X) = p WXW^ Both are continuous convex functions 
defined on Although the function V/(X) — S — X" 1 is not Lipschitz continuous over 
it is Lipschitz continuous within any compact subset of S>+ + (See Lemma [2] of the 
Supplemental section). 

Lemma 1 ([HOT- The solution of Problem Q, 0*, satisfies al r< ©* d for 

« = irarV- ^-n{ p -^^.-r}, (12) 

Ip1I 2 +pp I p J 

and 

f min{l T | ^ 1 1 1, (p - p^pa) H^ 1 ^- (p - l)a} if 5 G 

7 = S (I 3 ) 
I 21 T |(5+f/)- 1 |l-Tr((S+f/)- 1 ) otherwise, 

where / denotes the pxp dimensional identity matrix and 1 denotes the p-dimensional vector 
of ones. 

Note that / + g as defined is a continuous, strongly convex function on Moreover, by 
Lemma [2] of the supplemental section, / has a Lipschitz continuous gradient when restricted 
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to the compact domain al ^ ^ hi. Hence, / and g as defined meet the conditions 
described in Section 13.11 

The proximity operator of p \\X\\ 1 for p > is the soft-thresholding operator, r\ p : R pxp — > 
]R pxp , defined entrywise by 

[ Vp (X)} ij = sgn(X i>j )(\X itj \-p) + , (14) 
where for some jgR, (x) + := max(i, 0) (see [8]). Finally, the quadratic approximation Q^ t 



of /, as in equation (11), is given by 



Q fi (e t+1 ,e t ) = -io g det(e t ) + (5,6*) + (e m - e t ,s - e,- 1 ) + ||e m - e t ||J. (15) 

The G-ISTA algorithm for solving Problem is given in Algorithm [TJ As in |3j, the 
algorithm uses a backtracking line search for the choice of step size. The procedure terminates 
when a pre-specified duality gap is attained. The authors found that an initial estimate of Bo 
satisfying [©q]„ = (Su + p)^ 1 works well in practice. Note also that the positive definite check 
of 6 i+ i during Step (1) of Algorithm [I] is accomplished using a Cholesky decomposition, and 
the inverse of Qt+i is computed using that Cholesky factor. 

Algorithm 1: G-ISTA for Problem 

input : Sample covariance matrix S 1 , penalty parameter p, tolerance e, backtracking constant c € (0, 1), 

initial step size Cl,o^ initial iterate Go- Set A := 2s. 
while A > e do 

(1) Line search: Let Q be the largest element of {c J Ct,o}j=o,i,... so that for 
t+ i = rj^p (0 t — ( t (S — O^ -1 )), the following are satisfied: 

e t+1 ^o and /(9t+i) <Q Ct (e t+ i,e t ), 



for Qq ± as defined in ( 15 ) 



(2) Update iterate: Q t +i = Vc tP ( e * ~ Ct(S - 97/ )) 

(3) Set next initial step, Ct+i.o- See Section 

(4) Compute duality gap: 



3.2.1 



A = - logdet(S + Ut+i) - p - log det 9 t+1 + (S, 8) + p ||0 t+1 1^ , 



where (U t +i)i,j = min{max{([97/|^]i,j - S l . j ),-p}, p}. 
end 

output: e-optimal solution to problem ([I]), O* = &t+i- 
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3.2.1 Choice of initial step size, Co 



Each iteration of Algorithm [T] requires an initial step size, Co- The results of Section [4] 
guarantee that any ( < A m i n (6 t ) 2 will be accepted by the line search criteria of Step 1 in 
the next iteration. However, in practice this choice of step is overly cautious; a much larger 
step can often be taken. Our implementation of Algorithm [T] chooses the Barzilai-Borwein 
step P]. This step, given by 

Tr((6 w -9 f )(9 t+1 -e f )) 

is also used in the SpaRSA algorithm [35], and approximates the Hessian around Qt+i- If a 
certain number of maximum backtracks do not result in an accepted step, G-ISTA takes the 
safe step, A min (0 t ) 2 . Such a safe step can be obtained from A max (9j" 1 ), which in turn can be 
quickly approximated using power iteration. 



4 Convergence Analysis 

In this section, linear convergence of Algorithm [T] is discussed. Throughout the section, Q t 
(t — 1, 2, . . . ) denote the iterates of AlgorithmjTJ and 6* the optimal solution to Problem ([!]) 
for p > 0. The minimum and maximum eigenvalues of a symmetric matrix A are denoted 
by A m j n (A) and Amax(^4), respectively. 

Theorem 1. Assume that the iterates Qt of Algorithm^ satisfy al ^ Qt bl, Vt for some 
fixed constants < a < b. If ( t < a 2 ,Vt, then 

\\Qt+i - Q* P \\ F < max 

Furthermore, 

1. The step size ( t which yields an optimal worst-case contraction bound s(( t ) is C — 

2 

o- 2 +b" 2 ■ 

2. The optimal worst- case contraction bound corresponding to ( = a _ 2 2 +h _ 2 is given by 

Proof. A direct proof is given in the appendix. Note that linear convergence of proximal 
gradient methods for strongly convex objective functions in general has already been proven 



b 2 



\ Q t-%\\ F 



(17) 



S 



(see Supplemental section). □ 

It remains to show that there exist constants a and b which bound the eigenvalues of Gt, Vi. 
The existence of such constants follows directly from Theorem [TJ as Q t lie in the bounded 
domain {Q e S+ + : /(G) + g(Q) < /(Go) + fi'(Go)}, for all t. However, it is possible to 
specify the constants a and b to yield an explicit rate; this is done in Theorem [2j 

Theorem 2. Let p > 0, define a and f3 as in Lemma\l\ and assume ( t < a 2 ,Vt. Then 
the iterates G t of Algorithm [ij satisfy al ^ Q t ^ b'l,Vt, with b' = ||©*|| 2 + ||G - G* 
P+ y/p(P + a). 



, < 

P\\F — 



Proof. See the Supplementary section. □ 



Importantly, note that the bounds of Theorem [2] depend explicitly on the bound of Q*, as 
given by Lemma [TJ These eigenvalue bounds on Qt+i, along with Theorem [TJ provide a 
closed form linear convergence rate for Algorithm [T] This rate depends only on properties 
of the solution. 

Theorem 3. Let a and f3 be as in Lemma^ Then for a constant step size ( t '■= C < o 2 , 
the iterates of Algorithm^ converge linearly with a rate of 

*Q =l - *w?m=w <l (18) 

Proof. By Theorem [2j for ( < a 2 , the iterates Q t satisfy 

ai±e t ±(\\e;\\ 2 + \\e -e;\\ F )i 

for all t. Moreover, since al -< G* -< (31, if al -< Go ^ 01 (for instance, by taking 
Go = (S + pl)^ 1 or some multiple of the identity) then this can be bounded as: 

||e;|| a + ||g - q;\\ f < p + ^ ||g - g;|| 2 (19) 

</3 + Vft/3-a). (20) 

Therefore, 

al^e t ^(P + ^p((3-a))l, (21) 

and the result follows from Theorem [H □ 
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Remark 1. Note that the contraction constant (equation 18) of Theorem [3] is closely related 
to the condition number of 0* 

A ma x(0p 
A mml L V a 



as 



1 — >1 > 1 -2k(Q*)- 2 . (22) 

a 2 + ((3 + ^p{(3 - a)) 2 ~ a 2 + (3 2 ~ K p> 1 } 

Therefore, the worst case bound becomes close to 1 as the conditioning number of G* in- 
creases. 

It is important to compare the above convergence results to those that have been recently 
established. In particular, the useful, recent QUIC method [16J warrants a discussion. As soon 
as the sign pattern of its iterates match that of the true optimum, the non-smooth problem 
becomes effectively smooth and the QUIC algorithm reduces to a Newton method. At this 
point, QUIC converges quadratically; however, this is a very local property, and no overall 
complexity bounds have been specified for QUIC. This can be contrasted with our results, 
which take advantage of existing bounds on the optimal solution to yield global convergence 
(i.e., that we can always specify a starting point which meets our conditions). We also note 
that convergence rates have not been established for the glasso. However, QUIC and glasso 
can all be very fast in appropriate settings. Each brings a useful addition to the literature by 
taking advantage of different structural elements (block structure for glasso, second order 
approaches for QUIC, and conditioning bounds for G-ISTA). We feel there is no silver bullet; 
each method outperforms the others in certain settings. 



5 Numerical Results 



In this section, we provide numerical results for the G-ISTA algorithm. In Section 5.2, the 



theoretical results of Section [4] are demonstrated. Section 5J3 compares running times of the 



G-ISTA, glasso [TTJ, and QUIC [TB] algorithms. All algorithms were implemented in C++, 
and run on an Intel il - 2600& 3.40GHz x 8 core with 16 GB of RAM. 



10 



5.1 Synthetic Datasets 



Synthetic data for this section was generated following the method used by [2T| 122] . For a 
fixed p, a p dimensional inverse covariance matrix Q was generated with off-diagonal entries 
drawn i.i.d from a uniform(— 1, 1) distribution. These entries were set to zero with some 
fixed probability (in this case, either 0.97 or 0.85 to simulate a very sparse and a somewhat 
sparse model). Finally, a multiple of the identity was added to the resulting matrix so that 
the smallest eigenvalue was equal to 1. In this way, Q was insured to be sparse, positive 
definite, and well-conditioned. Datsets of n samples were then generated by drawing i.i.d. 
samples from a J\f p (0, distribution. For each value of p and sparsity level of Q, n — 1.2p 
and n = 0.2p were tested, to represent both the n < p and n > p cases. 





P 


0.03 


0.06 


0.09 


0.12 


problem 


algorithm 


time/iter 


time/iter 


time/iter 


time/iter 


p = 2000 
n = 400 
nnz(ft) = 3% 


nnz(ft*)/ K (Q*) 


27.65%/48.14 


15.08%/20.14 


7.24%/7.25 


2.39%/2.32 


glasso 


1977.92/11 


831.69/8 


604.42/7 


401.59/5 


QUIC 


1481.80/21 


257.97/11 


68.49/8 


15.25/6 


G-ISTA 


145.60/437 


27.05/9 


8.05/27 


3.19/12 


p = 2000 
n = 2400 
nnz(fi) = 3% 




14.56%/10.25 


3.11%/2.82 


0.91%/1.51 


0.11%/1.18 


glasso 


667.29/7 


490.90/6 


318.24/4 


233.94/3 


QUIC 


211.29/10 


24.98/7 


5.16/5 


1.56/4 


G-ISTA 


14.09/47 


3.51/13 


2.72/10 


2.20/8 


p = 2000 
n = 400 
nnz(ft) = 15% 




27.35%/64.22 


15.20%/28.50 


7.87%/11.88 


2.94%/2.87 


glasso 


2163.33/11 


862.39/8 


616.81/7 


48.47/7 


QUIC 


1496.98/21 


318.57/12 


96.25/9 


23.62/7 


G-ISTA 


251.51/714 


47.35/148 


7.96/28 


3.18/12 


p = 2000 
n = 2400 
nnz(ft) = 15% 




19.98%/17.72 


5.49%/4.03 


65.47%/1.36 


0.03%/1.09 


glasso 


708.15/6 


507.04/6 


313.88/4 


233.16/3 


QUIC 


301.35/10 


491.54/17 


4.12/5 


1.34/4 


G-ISTA 


28.23/88 


4.08/16 


1.95/7 


1.13/4 



Table 1: Timing comparisons for p = 2000 dimensional datasets, generated as in Section 5.1 
Above, nnz(A) is the percentage of nonzero elements of matrix A. 



5.2 Demonstration of Convergence Rates 

The linear convergence rate derived for G-ISTA in Section [4] was shown to be heavily depen- 
dent on the conditioning of the final estimator. To demonstrate these results, G-ISTA was 



run on a synthetic dataset, as described in Section [5TTj with p = 500 and n = 300. Regular- 
ization parameters of p = 0.75, 0.1, 0.125, 0.15, and 0.175 were used. Note as p increases, 0* 
generally becomes better conditioned. For each value of p, the numerical optimum was com- 
puted to a duality gap of 10~ 10 using G-ISTA. These values of p resulted in sparsity levels of 
81.80%, 89.67%, 94.97%, 97.82%, and 99.11%, respectively. G-ISTA was then run again, and 
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the Frobenius norm argument errors at each iteration were stored. These errors were plotted 
on a log scale for each value of p to demonstrate the dependence of the convergence rate on 
condition number. See Figure [TJ which clearly demonstrates the effects of conditioning. 



p 


= 0.075. k(O ) 


= 7.263 


p 


= 0.1, K(0*) = 

p 


3.9637 


p 


= 0.125, k(0*) 
p 


= 2.3581 


p 


= 0.15, k(0*) = 
P 


= 1 .6996 


p 


= 0.175, k(0*1 
P 


= 1 .3968 




~1 I I I I I I I I 

50 100 150 200 250 300 350 400 450 

iteration 



Figure 1: Semilog plot of ||0j — vs. iteration number t, demonstrating linear convergence 

rates of G-ISTA, and dependence of those rates on 



5.3 Timing Comparisons 

The G-ISTA, glasso, and QUIC algorithms were run on synthetic datasets (real datasets are 
presented in the Supplemental section) of varying p, n and with different levels of regulariza- 
tion, p. All algorithms were run to ensure a fixed duality gap, here taken to be 10 -5 . This 
comparison used efficient C++ implementations of each of the three algorithms investigated. 
The implementation of G-ISTA was adapted from the publicly available C++ implementa- 
tion of QUIC Hsieh et al. [IE]. Running times were recorded and are presented in Table [TJ 
Further comparisons are presented in the Supplementary section. 

Remark 2. The three algorithms variable ability to take advantage of multiple processors 
is an important detail. The times presented in Table [T] are wall times, not CPU times. The 
comparisons were run on a multicore processor, and it is important to note that the Cholesky 
decompositions and inversions required by both G-ISTA and QUIC take advantage of multiple 
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cores. On the other hand, the p 2 dimensional lasso solve of QUIC and p-dimensional lasso 
solve of glasso do not. For this reason, and because Cholesky factorizations and inversions 
make up the bulk of the computation required by G-ISTA, the CPU time of G-ISTA was 
typically greater than its wall time by a factor of roughly 4. The CPU and wall times of 
QUIC were more similar; the same applies to glasso. 



6 Conclusion 

In this paper, a proximal gradient method was applied to the sparse inverse covariance 
problem. Linear convergence was discussed, with a fixed closed-form rate. Numerical results 
have also been presented, comparing G-ISTA to the widely-used glasso algorithm and the 
newer, but very fast, QUIC algorithm. These results indicate that G-ISTA is competitive, 
in particular for values of p which yield sparse, well-conditioned estimators. The G-ISTA 



algorithm was very fast on the synthetic examples of Section |5.3[ which were generated from 
well-conditioned models. For poorly conditioned models, QUIC is very competitive. The 
Supplemental section gives two real datasets which demonstrate this. For many practical 
applications however, obtaining an estimator that is well-conditioned is important ([29| 134"]). 
To conclude, although second-order methods for the sparse inverse covariance estimation 
problem have recently been shown to perform well, simple first-order methods cannot be 
ruled out, as they can also be very competitive in many cases. 
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A Supplementary material 

A.l Lipschitz Continuity of Vf(X) 
Lemma 2. For any X,Y G S p ++ , 

l 2 \\X-Y\\ 2 <\\X-'-Y-\<^\\X-Y\\ 2 , 
where a = min{A min (X), A min (Y)} and b = max{A max (X), A max (F)}. 
Proof. To prove the right-hand side inequality, notice that 

X- 1 - Y- 1 = X-\Y - X)Y~ 1 . 



Thus, 



| X_1 ~ F_1 ||2 = H X_1 ( r - X ) r_1 || 2 

< wx-xwx-yw.wy-x 

= KUx-^KUy^U-YW 

- — —\\X-Y\\ 2 



a 

To prove the left inequality, note first that 



Amin 

(x) A min (y) 



Therefore, 



Y - X = X(X~ l -Y~ V )Y. 



\\X-Y\\ 2 = \\X{X-' -Y-')Y\\ 2 

< \\X\\ 2 \\X-' -Y-\\\Y\\ 2 

— A max (X)A max (y) \\X 1 — Y 1 | 

< b 2 \\X- l -Y- l \\ 2 . 



This shows that 

and concludes the proof. 
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The function V/(X) = S — X 1 is Lipschitz continuous on any compact domain, since for 
-+ 



X, Y E such that al r< X, Y ^ 6/, 



||v/(x)-v/(r)|| F = ||x- 1 -r- 1 || F 

or 

<4iiA'-m F . 

a 2 

A. 2 Proof of Theorem Q] 

We now provide the proof of Theorem 1. 

Lemma 3. Let Q t be as in Algorithm [T] and let O* be the optimal point of problem Q. 
Also, define 

b := max{Amax(6t),Amax(0p)} , a:= min {A min (G t ), A min (9*)} . 



Then 



|0 m - 6*|| F < max 



1 - 



b 2 



\^-e;\\ F 



Proof. By construction in Algorithm [TJ 

e*+i = ((©* - as - er 1 )) 

Moreover, as 0* is a fixed point of the ISTA iteration [HI Prop. 3.1], it satisfies 

©; = %p (©; - - (©;n) • 

The soft-thresholding operator rj p (-) is a proximity operator corresponding to yo || • 1 1 i - Since 
prox operators are non-expansive [8j Lemma 2.2], it follows that: 

||e m - e;|| F = \\ VCtp (e t - as - e,- 1 )) - VCtp (e; - q(s - (ep- 1 )) || F 



< ||e t - Ct(s - Qi 1 ) - (e; - c*(s - (e;)- 1 ;, „ ,.. 
= ||(e t + CA" 1 )-(0; + O(©;)- 1 )|| F 



To bound the latter expression, recall that if h : U C R n — )■ lR m is a different iable mapping, 
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with x,y G U, and cx + (1 — c)y G U for all c G [0, 1], then 

||/i(z)-%)|| < sup {\\J h (cx + (l-c)y)\\\\x-y\\} 

ce[o,i] 

where J^(-) is the Jacobian of /i. Define /i 7 : §+ + — > R p2 by 

/i 7 (X) = vec(X) + vec( 7 X- 1 ), 
where vec(-) : M. pxp — > W 2 is the vectorization operator defined by 

vec(A) = (A lj ,A 2i ,...,A p f 
with A i: the i th row of A. Note that for X G 

^=/ p2 and — = -X>*X\ 

where ® is the Kronecker product and J p 2 is the p 2 x identity matrix. Then the Jacobian 
of /i 7 is given by: 

j hl (x) = i P 2- 1 x- 1 ®x-\ 

Application of the mean value theorem to h^ t over Z tfi = vec(c© t + (1 — c)0*), c G [0, 1] 
yields 

1 1 Me*) - hM)\\ F < snp{||/ p2 -CtZ^ 1 ® z^nj ||vec(e t ) - vec(ep|| 2 

= sup {||/ p2 - c^ 1 ® z^nj ||e* - e;|| F . 

Denoting the eigenvalues of Z t)C for given values of t and c as < 71 < 72 < • • • < 7p , the 
eigenvalues of I p i - ( t Z^ <g> Z^ 1 are {1 - Ct(7i7j) _1 }fj =r B Y Weyl's inequality, 

7 P = A max (Z tiC )< max {A max (6 t ), A max (9*)} 
7i = A min (Z tiC ) > min {A min (0 t ), A min (0*)} , 
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and therefore 



A min (i p2 - Qzrj ®z^ i c ) = i-%>i-% 

A max (V - CtZ^ 1 ® Z-J) = 1 - |< 1 - |. 



Hence, 



sup 



Up 2 - Ci^< 



~ L ^ ^cHjj !> < max 



fe 2 



which completes the proof. 



It follows from Lemma [3] that Algorithm [T] converges linearly if 



St (Ct) := max 



b 2 



Ct 



Since the minimum of 



is at ( 



s(C) = max 



1 



e (0,1), vt. 

6 2 



□ 



(23) 



ffl _ 2+b -2 ; Theorem 1 follows directly from Lemma 3. It now remains to show 
that the eigenvalues of the G-ISTA iterates remain bounded in eigenvalue. A more general 
convergence result for strongly convex functions exists in the literature; this result is stated 
below. 



Theorem 4. Let f be strongly convex with convexity constant \i, and V/ be Lipschitz con- 
tinuous with constant L. Then for constant step size < £ < \, the iterates of the ISTA 
iteration (equation {x t }t>o to minimize f + g as in Q ; satisfy 



\x t+ i - x*\\ F < max{|l - C,L\ , |1 - C/i|} \\x t - x* 



\F ' 



which is to say that they converge linearly with rate max {|1 — (L\ , |1 — CHI- Furthermore, 



2 

2. The optimal worst-case contraction bound corresponding to £ = — is given by 



1. The step size which yields an optimal worst-case contraction bound is ( 

2 



S (C): = max{|l-CL|,|l-C/i|} 
2 

= 1 - 
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Proof. See [7J [26] and references therein. □ 
A.3 Proof of Theorem [2] 

In this section, the eigenvalues of Ot,Vt are bounded. To begin, the eigenvalues of t+ i : = 
©t — Ct(S — Of 1 ) are bounded. 

Lemma 4. Let < a < b be given positive constants and let Q > 0. Assume al ^ Q t ^ W- 
Then the eigenvalues of t+ i := 0>t — (t(S — O^ 1 ) satisfy: 

2v^ - CA max (5) ifa<v^<6 
A min (e t+ i) > <{ (24) 
mm (a + ^, 6 + f ) - CtAmax(S') otherwise 



and 



A, )mx (0 /+ i) < max ( a+ ^,&+ y ) - A lllin ( S). 



r 



Proof. Denoting the eigenvalue decomposition of 0t by 0t = UYU 

e t+| = e, - o(5 - ©r 1 ) 

= UTU T -Q(S- UT^U T ) 

= u (r - o(f/ T 5f/ - r- 1 )) f/ T 

Let T = diag(7x, . . . , j p ) with 7i < • • • < 7 p . By Weyl's inequality, the eigenvalues of t+ i 
are bounded below by 



Xi (® t+ i) > li + — - CtAmax(5'), 

and bounded above by 

Xi (©t+i) < li + — ~ CtX min (S) 



The function /(x) = x + — over a < x < b has only one extremum which is a global minimum 
at x — i/Ct- Therefore, 

Ct f 2^ ifa<V<t<& 

mm x H = < , 

a<x<b x y mm ( a _)_ k ) 5 _|_ Ci) otherwise 
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and 

. C* / , Ct , . Ct 

max x H = max a + —,b + — 

a<x<b x \ b b 

Since a < 71 < b, 

Ct 

A mi „(© t+ i) > 71 + — - (Am^(S) 

> min (x + -) - CAmax(5') 

a<x<6 y X / 

2 v / 0-GAmax(5) ifa< V^<& 

min (a + f , 6 + f ) - CtA max (5') otherwise 



Similarly, 



Amax(@ t+ i) < 7p + — - CtAmin('S') 

- 7p 



< max (x + — J -(Amm(S) 

a<x<b \ X J 

= max (a + ^, b + j J - CtA mi „(5'). 



□ 

It remains to demonstrate that the soft-thresholded iterates Qt+i remain bounded in eigen- 
value. 

Lemma 5. Let < a < b and ( t > 0. Then: 

• I 1 Ct 7 . Ct\ . Ct 

mm \ a -\ ,0 + — = a H 

\ a J a 

if and only if ( t < afe- 

Proof. Under the stated assumptions, 

a b \a b J 

b — a 

^Ct<— r 

a fe 

□ 



22 



Lemma 6. Let A be a symmetric p x p matrix. Then the soft-thresholded matrix r] t (A) 
satisfies 

A min (A) -pe< X min {Ve{A)) 
In particular, A e is positive definite if X m i n (A) > pe. 

Proof. Let 

A := {M G M p : M id G {0, 1, -1}}. 
For every e > 0, the matrix A e can be written as 

Ve(A) =A + eiAi + e 2 A 2 + ■■■ + e k A k , 

for some k < (^) + P where Ai G A, e» > and Yli=i e « = e - Now let 

c p := max{|A min (M)| : M G .4}. 

The constant c p is finite since A is a finite set. Since —A G ^4 for every Aei, and since 
|A min (--4)| = |A max (A)|, it follows that 

c p = max{|A max (M)| : M e A}. 

Applying the Gershgorin circle theorem [see, e.g.,[T3] gives c p < p. Since p is an eigenvalue 
of the matrix B such that Bij = 1 for all i, j, it follows that c p = p. 

Recursive application of Weyl's inequality gives that 

A min {Ve{A)) > A min (A) - e\X max (Ai)\ e k \X imK (A k )\ 

k 



^ Amin 

A m ; n (v4) Cp€ 



i 

i=l 



□ 



Recall from Lemma[T]that the eigenvalues of the optimal solution to problem ([T]) are bounded 
below by woir-, — • The following theorem shows that a = wair^, — is a valid bound to ensure 
that ai ^ 6*+i if ai di ®t- 

Lemma 7. Let p > and a = n-oirn — < b'. Assume ai -< Q t -< b' and consider 
r \\s\\2+pp — 1 — 

e t+ i = V( tP (e t - as - e- 1 )) 
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Then for every < ( t < « 2 , al ^ Qt+i- 

Proof. The result follows by combining Lemma [4] and Lemma [6j Notice first that the hy- 
pothesis ( t < a 2 guarantees that ^/Q ^ [a, b'}. Also, from Lemma [5J we have 

mm cH ,0 + — = cH 

since (t < a 2 < Hence, by Lemma [4], 

Amin(@ t+ i ) > min ^« + ^ , b' + - (Am^(S) 

= a + — - ( t \ max (S). 
a 

Now, applying Lemma |6j to ®t+i = ??Ctp(®t+§)' we obtain 

Amin(@t+l) = A min (%p(@ t+ i)) 

> A min (0 t+ i) -pp( t 

> a + — - C*A max (S') - ppCt- 

a 

We therefore have al -< Ot+i whenever 

a + — - CtAmax(5') - pp(t > oc. 
a 

This is equivalent to 

Ct ~ Amax(5') - pp^j > 0. 

Since > 0, this is equivalent to 

- - A max (5') - pp > 0. 
a 

Reorganizing the terms of the previous equation, we obtain that al ^ t+ i if 

1 1 

a < 



Amax(5')+PP \\S\\2+PP' 



□ 



It remains to show that the eigenvalues of the iterates Qt remain bounded above, for all t. 
Lemma 8. Let a = || 5 |j* +pp and let Q < a 2 ,Wt. Then the G-ISTA iterates Qt satisfy 
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e t ± vi, vt, with 6' = ||e;|| 2 + ||e - e;\\ F . 

Proof. By Lemma [7| al ^ Qt for every t. As al ^ 0* (Lemma [TJ, 

A- := min{A min (e t ),A min (e;)} 2 > a 2 , 
for all £. Also, since > and Q < a 2 , 

c* 



max 



6 2 



1 , 



< 1. 



Therefore, by Lemma [3j 



|e t -e;i| F < ||e t _i - e;|| F . 



Applying this result recursively gives 

||©*-©;Lf< ||e„-e;|| F . 

Since || ■ || 2 < || ■ \\f, we therefore have 

||©t|| 2 - ||©;|| 2 < ||©t - e;n 2 < no* - e;\\ F < \\e - e;\\ F , 

and so, 

A m ax(© t ) = ii© t || 2 <ii©;ii 2 + ii©o-©;iiF 

which completes the proof. □ 
A. 4 Additional timing comparisons 

This section provides additional synthetic timing comparisions for p = 500 and p = 5000. In 
addition, two real datasets were investigated. The "estrogen" dataset |27] contains p = 652 
dimensional gene expression data from n = 158 breast cancer patients. The "temp" dataset 
[B] consists of average annual temperature measurements from p = 1732 locations over 
n = 157 years (1850-2006). 
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P 


0.05 


0.10 


0.15 


0.20 


problem 


algorithm 


time/iter 


time/iter 


time/iter 


time/iter 


p = 500 
n = 100 
nnz(ft) = 3% 


n WO*) MO*) 


31.61%/42.76 


19.61%/18.23 


11.08%/8.13 


5.02%/3.06 


glasso 


28.34/11 


10.91/8 


7.08/7 


5.57/6 


QUIC 


8.33/23 


1.98/13 


0.96/11 


0.38/10 


G-ISTA 


4.44/402 


1.14/110 


0.30/38 


0.14/18 


p = 500 
n = 600 
nnz(ft) = 3% 


nnzfft*) / k(Q*„) 
\ pji \ pj 


20.73%/6.62 


3.93%/2.44 


0.90%/1.49 


0.13%/1.20 


glasso 


7.44/6 


4.53/5 


3.45/4 


2.62/3 


QUIC 


1.08/9 


0.17/7 


0.06/5 


0.04/5 


G-ISTA 


0.28/31 


0.10/13 


0.07/9 


0.03/5 


p = 500 
n = 100 
nnz(fi) = 15% 


n WO*) MO*) 
\ p ) i \ pj 


31.36%/46.83 


19.74%/19.93 


11.65%/8.95 


5.45%/3.25 


glasso 


28.61/11 


11.27/8 


7.22/7 


5.34/6 


QUIC 


8.47/23 


2.01/13 


0.73/9 


0.22/7 


G-ISTA 


4.80/466 


1.09/115 


0.28/34 


0. 15/20 


p = 500 
n = 600 
nnz(O) = 15% 


nWO*)/ K (0*) 


24.81%/9.78 


6.36%/2.64 


0.79%/1.28 


0.03%/1.08 


glasso 


8.52/6 


4.59/5 


3.55/4 


2.54/3 


QUIC 


1.56/10 


0.25/7 


0.05/5 


0.03/5 


G-ISTA 


0.50/51 


0.10/13 


0.06/7 


0.02/3 



Table 2: Timing comparisons for p = 500 dimensional datasets, generated as in Section 5.1 





P 


0.02 


0.04 


0.06 


0.08 


problem 


algorithm 


time/iter 


time/iter 


time/iter 


time/iter 


p = 5000 
n = 1000 
nnz(O) = 3% 


nnz(n;)/ K (n* p ) 


26.22%/54.47 


13.68%/23.74 


6.36%/8.69 


2.03%/2.31 


glasso 


30814.29/11 


12612.85/8 


9224.79/7 


6184.84/5 


QUIC 


22547.70/21 


3725.07/11 


946.11/8 


199.48/6 


G-ISTA 


2651.43/575 


417.20/94 


93.33/25 


39.05/11 


p = 5000 
n = 6000 
nnz(r2) = 3% 


nm(n* )/K(W p ) 


12.89%/15.18 


3.23%/3.73 


1.11%/1.60 


0.16%/1.16 


glasso 


10307.26/7 


8725.86/7 


4846.58/4 


3587.35/3 


QUIC 


3108.14/10 


396.60/7 


86.66/5 


21.56/4 


G-ISTA 


268.28/70 


50.17/14 


35.67/10 


28.82/8 


p = 5000 
n = 1000 
nnz(fi) = 15% 


nm(Q* p )/K(Qr p ) 


26.08%/80.04 


13.93%/37.12 


6.91%/16.52 


2.47%/3.08 


glasso 


36302.86/11 


13413.57/8 


9914.41/7 


7408.33/6 


QUIC 


22667.29/21 


4649.99/12 


1329.20/9 


240.25/6 


G-ISTA 


3952.85/849 


701.57/170 


176.11/45 


42.46/12 


p = 5000 
n = 6000 
nnz(ft) = 15% 


nnz(n*)/ K (0*) 


18.65%/27.69 


5.34%/7.26 


0.66%/1.41 


0.03%/1.09 


glasso 


13180.47/7 


9052.77/7 


4842.28/4 


3578.05/3 


QUIC 


6600.91/12 


795.46/8 


59.03/5 


16.10/4 


G-ISTA 


804.93/189 


103.69/23 


36.17/10 


18.87/5 



Table 3: Timing comparisons for p = 5000 dimensional datasets, generated as in Section 5.1 
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P 


0.15 


0.30 


0.45 


0.60 


problem 


algorithm 


time/iter 


time/iter 


time/iter 


time/iter 


p = 682 
n = 158 
Dataset: estrogen 


nnz(0;)/ K (0;) 


5.29%/290.03 


3.39%/88.55 


2.31%/29.69 


1.63%/8.96 


glasso 


106.18/24 


120.18/34 


110.54/35 


40.52/13 


QUIC 


12.36/19 


2.71/11 


1.08/9 


0.54/7 


G-ISTA 


43.96/2079 


11.99/595 


3.23/172 


1.00/53 




P 


0.2 


0.4 


0.6 


0.8 


problem 


algorithm 


time/iter 


time/iter 


time/iter 


time/iter 


p = 1732 
n = 157 
Dataset: temp 


nnz(0*)/«(0* p ) 


2.02%/1075.8 


1.77%/289.63 


1.34%/23.02 


0.22%/2.10 


glasso 


1919.64/31 


2535.86/46 


1144.07/22 


254.14/5 


QUIC 


497.47/18 


103.76/13 


10.16/8 


2.31/7 


G-ISTA 


1221.40/6194 


183.20/819 


30.01/159 


1.78/10 



Table 4: Timing comparisons for the real datasets described above. 
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